These are code snippets and a short summary of the fourth chapter, Geocentric Models, of the book “Statistical Rethinking” (version 2) by Richard McElreath.

Why normal distributions are normal

The chapter discusses linear models and starts with a recap on the normal distributions. Why is it such a commonly used distribution and how does it arise?

pos <- replicate( 1000, sum( runif(16, -1, 1)))
dens(pos, norm.comp = T)

growth <- replicate(1000, prod(1 + runif(12, 0, 0.1)))
dens(growth, norm.comp = T)

This fails, if the growth factor is too large:

big <- replicate(1000, prod(1 + runif(12, 0, 0.5)))
dens(big, norm.comp = T)

The smaller the factor, the better the approximation:

small <- replicate(1000, prod(1 + runif(12, 0, 0.0001)))
dens(small, norm.comp = T)

log.big <- replicate(1000, log( prod( 1 + runif(12, 0, 0.05))))
dens(log.big, norm.comp = T)

There are boradly two reasons why we’re using Gaussian distributions in modelling: (1) ontological: we have reason to believe the process we’re modelling is actually following a Gaussian distribution, for the reasons laid out above. E.g. measurement errors follow a Gaussian since at the heart, the measurement errors arise through a process of added fluctuations. (2) epistemological: we don’t really know much about our process, except that it has a mean and a variance, so we use a normal distribution because it makes the least assumptions. (If we know more, we should probably use a different distribution!)

A Gaussian model of height

For the example, we’ll be using the !Kung data to estimate height.

A short look at the data:

# The !Kung Data
data("Howell1")
d <- Howell1
str(d )
## 'data.frame':    544 obs. of  4 variables:
##  $ height: num  152 140 137 157 145 ...
##  $ weight: num  47.8 36.5 31.9 53 41.3 ...
##  $ age   : num  63 63 65 41 51 35 32 27 19 54 ...
##  $ male  : int  1 0 0 1 0 1 0 1 0 1 ...
precis(d)
x
138.2635963
35.6106176
29.3443934
0.4724265
x
27.6024476
14.7191782
20.7468882
0.4996986
x
81.108550
9.360721
1.000000
0.000000
x
165.73500
54.50289
66.13500
1.00000
x
▁▁▁▁▁▁▁▂▁▇▇▅▁
▁▂▃▂▂▂▂▅▇▇▃▂▁
▇▅▅▃▅▂▂▁▁
▇▁▁▁▁▁▁▁▁▇

Since height strongly correlates with age before adulthood, we’ll only use adults for the following analysis:

d2 <- d[ d$age >= 18 ,]
dens(d2$height)

The distribution of height then looks approximately Gaussian.

Our model looks as follows: \[\begin{align*} h_i &\sim \text{Normal}(\mu, \sigma)\\ \mu &\sim \text{Normal}(178, 20) \\ \sigma &\sim \text{Uniform}(0, 50) \end{align*}\]

To better understand the assumptions we’re making with our priors, let’s have a short look at them:

curve( dnorm( x, 178, 20 ), from=100, to=250) 

curve( dunif( x, 0, 50), from=-10, to=60) 

To better understand what these priors mean, we can do a prior predictive simulation. That is, we sample from our priors and pluck this into the likelihood to find out what the model thinks would be reasonable observations, just based on the priors, before having seen the data.

sample_mu <- rnorm(1e4, mean=178, sd=20)
sample_sigma <- runif(1e4, 0, 50)
prior_height <- rnorm(1e4, mean=sample_mu, sd=sample_sigma)
dens(prior_height)

The model expects people to have a height between 100 and 250cm. Wide range, but seems reasonable. Compare this with the prior predictive we would get from flat priors (changing the standard deviation for the \(\$mu\) prior to 100):

sample_mu <- rnorm(1e4, mean=178, sd=100)
prior_height <- rnorm(1e4, mean=sample_mu, sd=sample_sigma)
dens(prior_height)

The flat prior think people could have negative height (to the left of the dashed line) or be extremely tall (the right line indicates the height of one of the tallest persons ever recorded). Not very sensible.

For educational purpose, let’s do a grid approximation of our model.

Start with the grid:

mu.list <- seq(from=150, to=160, length.out = 200)
sigma.list <- seq(from=7, to=9, length.out = 200)
post <- expand.grid(mu=mu.list, sigma=sigma.list)

For each value in the grid, compute the log-likelihood for the whole data:

post$LL <- sapply( 1:nrow(post), function(i) sum( dnorm( 
    d2$height,                                            
    mean=post$mu[i],                                    
    sd=post$sigma[i],                                 
    log=TRUE                                            
)))

Since we’re working with the log-likelihood, we can sum everything up instead of multiplying. This is mostly to avoid rounding errors. So the same way, we can add the prior densities (again on log):

post$prod <- post$LL + dnorm( post$mu, 178, 20, TRUE) +   # add log likelihood value to log prior values (corresponds to multiplying)
  dunif( post$sigma, 0, 50, TRUE)

Finally, we return from log-scale to normal again but first scale it to avoid too small values and rounding errors.

post$prob <- exp( post$prod - max(post$prod))

Because of the scaling, the resulting values are actually not proper probabilities but relative probabilities.

We can visualize the distribution using a contour map:

contour_xyz( post$mu, post$sigma, post$prob)

Or a heatmap:

image_xyz( post$mu, post$sigma, post$prob)

The highest probability density for \(\mu\) is somewhere around 155 with highest probality densitiy for \(\sigma\) around 8.

Instead of working with the grid posterior, we can use the grid to obtain a sample from our posterior:

# sample from posterior
sample.rows <- sample( 1:nrow(post), size=1e4, replace=TRUE,
                       prob=post$prob)
sample.mu <- post$mu[ sample.rows ]
sample.sigma <- post$sigma[ sample.rows]

And can plot this instead:

plot(sample.mu, sample.sigma, cex=0.5, pch=16, col=col.alpha(rangi2, 0.1))

In the plot, we can still the as artifacts the grid we used to approximate the posterior. One could either pick a finer grid or add a tiny bit of jitter to the samples to get rid of these artifacts.

We can also look at the marginal posterior:

dens( sample.mu )

As sample size increases, posterior densities approach the normal. Indeed, the posterior for \(\mu\) looks very normal already. The posterior for \(\sigma\) on the other hand is very slightly right-skewed:

dens( sample.simga)

We can compute posterior compatibility intervals:

PI( sample.mu )
##       5%      94% 
## 153.9196 155.2764
PI( sample.sigma)
##       5%      94% 
## 7.311558 8.246231
Short note on why the standard deviation is less normal:

Posterior of \(\sigma\) tend to have a long-right hand tail. The reasons? Complex! But an intution is that, because the standard deviation has to be positive, there is less uncertainty how small it is (it is bounded by 0 after all) and more uncertainty about how big it is. If we repeat the example from above on a small subset of the data, this tail in the uncertainty \(\sigma\) becomes more pronounced:

There’s a longer tail at the top of this point-cloud. We can also see it in the marginal density of \(\sigma\):

Since grid approximation becomes infeasible with more complex models, we now start using quadratic approximation. As we’ve seen above, the posterior approaches a normal distribution with enough data, so we can use that to approximate the posterior. The quadratic approximation, quap(), first climbs the posterior distribution to find the mode, i.e. the MAP (Maximum A Posteriori) and then estimates the quadratic curvature.

To use quap(), we place our model into a formula list (flist() in R):

flist <- alist(
  height ~ dnorm( mu, sigma) ,
  mu ~ dnorm( 178, 20),
  sigma ~ dunif(0, 50)
)

m4.1 <- quap( flist, data=d2 )
precis( m4.1 )
x
154.607226
7.732336
x
0.4120481
0.2914806
x
153.948693
7.266494
x
155.265758
8.198178

Comparing this with the compatibility intervals from the grid approximation model before, we that they’re nearly identical:

PI( sample.mu )
##       5%      94% 
## 153.9196 155.2764
PI( sample.sigma )
##       5%      94% 
## 7.311558 8.246231

Let’s see what happens if we use a very narrow prior on \(\mu\) instead:

m4.2 <- quap( 
          alist(
            height ~ dnorm(mu, sigma),
            mu ~ dnorm(178, 0.1),
            sigma ~ dunif(0, 50)
          ), 
          data=d2)
precis(m4.2)
x
177.86375
24.51756
x
0.1002354
0.9289235
x
177.70356
23.03297
x
178.02395
26.00216

Now the estimate for \(\mu\) has hardly moved off the prior. Note however, how the posterior for \(\sigma\) changed by quite a lot, even though we didn’t change its prior. Since we told the model that \(\mu\) is basically 178, very convincingly with this prior, it changes the estimate for \(\sigma\) in return to make the data fit with the model.

To sample from the quadratic approximation, note that the quadratic approximation of a posterior distribution is just a multi-dimensional Gaussian distribution. To sufficiently describe a multidimensional Gaussian distribution, we only need a list of means and a matrix of variances and covariances.

vcov( m4.1 )    # variance-covariance matrix
mu sigma
mu 0.1697837 0.0002226
sigma 0.0002226 0.0849609

We can decompose the matrix of variances and covariances into a vector of variances and a correlation matrix:

diag( vcov(m4.1 )) 
##         mu      sigma 
## 0.16978365 0.08496091
cov2cor( vcov( m4.1 )) 
mu sigma
mu 1.0000000 0.0018533
sigma 0.0018533 1.0000000

We could use the variances and correlation matrix to sample from a multi-dimensional Gaussian distribution. Or simply use the provided short-cut:

post <- extract.samples( m4.1, n=1e4)
head(post)
mu sigma
154.8929 8.052744
154.7968 7.470122
154.3875 7.852494
155.1543 7.621551
154.5696 7.920499
155.7326 7.819466
precis(post)
x
154.608259
7.731543
x
0.4107996
0.2907791
x
153.952949
7.265869
x
155.261994
8.198561
x
▁▁▁▅▇▂▁▁
▁▁▁▂▅▇▇▃▁▁▁▁

Equivalently, we could also get the posterior samples manually as follows:

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
post <- mvrnorm(n = 1e4, mu=coef(m4.1), Sigma=vcov(m4.1))

Full code.